Code to clean the data file-by-file

Importing the necessary libraries

In [1]:
import pandas as pd
import csv
import string
import re
import nltk

nltk.download('stopwords')
nltk.download('names')
from nltk.corpus import stopwords
from nltk.corpus import names
from nltk import word_tokenize
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Aruna\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package names to
[nltk_data]     C:\Users\Aruna\AppData\Roaming\nltk_data...
[nltk_data]   Package names is already up-to-date!
In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

%matplotlib inline
pd.set_option('display.max_colwidth', 150)

(A) Read the CSV File

In [3]:
df = pd.read_csv("C:\\Users\\Aruna\\Documents\\input\\Amazon VPC.csv")

df['description'] = df['description'].apply(lambda x: " ".join(x for x in str(x).split())) # converting to string
 
df.head(10)
Out[3]:
id label description
0 6967 Amazon VPC Instances in VPC cannot connect to the internet without an elastic ip When trying to do a yum update I get this. root@com01 ~# yum update Loaded p...
1 6967 Amazon VPC There are three ways to access the Internet. You can use an Internet Gateway along with an Elastic IP for each instance, set up a NAT instance, or...
2 6967 Amazon VPC Oh that could explain a lot. But not how I could connect earlier. That I don't understand. So what I need is a NAT device for those instances that...
3 6967 Amazon VPC IGW along does not allow you to gain access to the Internet. You must have EIP assigned to your instance or have all your instances behind NAT Ins...
4 6967 Amazon VPC Assuming you don't want to assign public IP addresses to these instances, you need to set up a NAT instance. It is a regular EC2 instance running ...
5 6967 Amazon VPC Hi, In your response you state the following "If you have all instances in a public subnet right now, I'm afraid you will need to make some change...
6 6967 Amazon VPC Hi Matthew, How did you solve your problem. I am facing similar situation as yours. I have a VPC created with the option "VPC with a Single Public...
7 6967 Amazon VPC Hi Riddhi, I also face the same problem with a similar configuration: private instance within a single public subnet, did you find a solution to g...
8 6967 Amazon VPC For any instances in a 'private' subnet, which can be defined as an instance in a subnet that does not have a route to the internet (0.0.0.0/0) vi...
9 6966 Amazon VPC Route Selection in Transit Gateway If we've got a Transit Gateway configured and we have multiple VPN connections into the TGW (from multiple rout...
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21629 entries, 0 to 21628
Data columns (total 3 columns):
id             21629 non-null int64
label          21629 non-null object
description    21629 non-null object
dtypes: int64(1), object(2)
memory usage: 507.0+ KB

Check out one sample post:

In [16]:
p = 2000

df['description'][p]
Out[16]:
"For instances without a public IP on private subnets to get outbound internet you need a NAT and IGW ( internet gateway) components . They will not have originating from the outside inbound access. Only instances with public IPs ( or via a LB) have that. You also have to set up the routing tables for those NAT'd subnets to go via the NAT"

Top 30 words + frequency of each:

In [17]:
pd.Series(' '.join(df['description']).split()).value_counts()[:30]
Out[17]:
the         68465
to          53237
I           31850
a           29658
and         24929
in          21744
is          20277
VPC         19169
you         18213
for         16193
have        14693
that        14532
on          13670
of          12885
this        11419
with        10903
it          10695
your        10582
can          9515
be           9513
from         9470
my           9277
not          8648
instance     8401
VPN          8387
are          7993
default      7787
an           7271
as           7163
IP           6539
dtype: int64
In [18]:
print("There are totally", df['description'].apply(lambda x: len(x.split(' '))).sum(), "words before cleaning.")
There are totally 1697043 words before cleaning.

(B) Text Pre-processing

In [19]:
STOPWORDS = stopwords.words('english')
my_stop_words = ["hi", "hello", "regards", "thank", "thanks", "regard", "best", "wishes", "hey", "amazon", "aws", "s3",
"elastic", "beanstalk", "rds", "ec2", "lambda", "cloudfront", "cloud", "front", "vpc", "sns", "me",
"january", "february", "march", "april", "may", "june", "july", "august", "september", "october", 
"november", "december", "jan", "feb", "mar", "apr", "jun", "jul", "aug", "sep", "sept", "oct", "nov",
"dec", "monday", "tuesday", "wednesday", "thursday", "friday", "saturday", "sunday", "mon", "tue",
"wed", "thu", "fri", "sat", "sun", "ain't", "aren't", "can't", "can't've", "'cause", "could've", "couldn't",
"couldn't've", "didn't", "doesn't", "don't", "hadn't", "hadn't've", "hasn't", "haven't", "he'd", "he'd've",
"he'll", "he'll've", "he's", "how'd", "how'd'y", "how'll", "how's", "i'd", "i'd've", "i'll", "i'll've", "i'm",
"i've", "isn't", "it'd", "it'd've", "it'll", "it'll've", "it's", "let's", "mayn't", "might've", "mightn't",
"mightn't've", "must've", "mustn't", "mustn't've", "needn't", "needn't've", "oughtn't", "oughtn't've", "shan't",
"sha'n't", "shan't've", "she'd", "she'd've", "she'll", "she'll've", "she's", "should've", "shouldn't", "shouldn't've",
"so've", "so's", "that'd", "that'd've", "that's", "there'd", "there'd've", "there's", "they'd", "they'd've", "they'll",
"they'll've", "they're", "they've", "to've", "wasn't", "we'd", "we'd've", "we'll", "we'll've", "we're", "we've",
"weren't", "what'll", "what'll've", "what're", "what's", "what've", "when's", "when've", "where'd", "where's",
"where've", "who'll", "who'll've", "who's", "who've", "why's", "why've", "will've", "won't", "won't've", "would've",
"wouldn't", "wouldn't've", "yall", "yalld", "yalldve", "yallre", "yallve", "youd", "youdve", "youll",
"youllve", "youre", "youve", "do", "did", "does", "had", "have", "has", "could", "can", "as", "is",
"shall", "should", "would", "will", "you", "me", "please", "know", "who", "we", "was", "were", "edited", "by", "pm"]

name = names.words()
STOPWORDS.extend(my_stop_words)
STOPWORDS.extend(name)

REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,:;#+?]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z - _.]+')
REMOVE_HTML_RE = re.compile(r'<.*?>')
REMOVE_HTTP_RE = re.compile(r'http\S+')

STOPWORDS = [BAD_SYMBOLS_RE.sub('', x) for x in STOPWORDS]

Convert to lowercase

In [20]:
df['description'] = df['description'].apply(lambda x: " ".join(x.lower() for x in str(x).split(" ")))

df['description'][p]
Out[20]:
"for instances without a public ip on private subnets to get outbound internet you need a nat and igw ( internet gateway) components . they will not have originating from the outside inbound access. only instances with public ips ( or via a lb) have that. you also have to set up the routing tables for those nat'd subnets to go via the nat"

Remove all HTML tags

In [21]:
df['description'] = df['description'].apply(lambda x: " ".join(REMOVE_HTML_RE.sub(' ', x) for x in str(x).split()))

df['description'][p]
Out[21]:
"for instances without a public ip on private subnets to get outbound internet you need a nat and igw ( internet gateway) components . they will not have originating from the outside inbound access. only instances with public ips ( or via a lb) have that. you also have to set up the routing tables for those nat'd subnets to go via the nat"
In [22]:
df['description'] = df['description'].apply(lambda x: " ".join(REMOVE_HTTP_RE.sub(' ', x) for x in str(x).split()))

df['description'][p]
Out[22]:
"for instances without a public ip on private subnets to get outbound internet you need a nat and igw ( internet gateway) components . they will not have originating from the outside inbound access. only instances with public ips ( or via a lb) have that. you also have to set up the routing tables for those nat'd subnets to go via the nat"

Replace certain characters by space (quotation marks, parantheses etc)

In [23]:
df['description'] = df['description'].apply(lambda x: " ".join(REPLACE_BY_SPACE_RE.sub(' ', x) for x in str(x).split()))

df['description'][p]
Out[23]:
"for instances without a public ip on private subnets to get outbound internet you need a nat and igw   internet gateway  components . they will not have originating from the outside inbound access. only instances with public ips   or via a lb  have that. you also have to set up the routing tables for those nat'd subnets to go via the nat"

Remove any unwanted symbols (like $, @ etc)

In [24]:
df['description'] = df['description'].apply(lambda x: " ".join(BAD_SYMBOLS_RE.sub('', x) for x in str(x).split()))

df['description'][p]
Out[24]:
'for instances without a public ip on private subnets to get outbound internet you need a nat and igw internet gateway components . they will not have originating from the outside inbound access. only instances with public ips or via a lb have that. you also have to set up the routing tables for those natd subnets to go via the nat'

Remove trailing punctuation marks and any symbol patterns

In [25]:
df['description'] = df['description'].apply(lambda x: " ".join(x.strip('.') for x in x.split()))
df['description'] = df['description'].apply(lambda x: " ".join(x.strip('-') for x in x.split()))
df['description'] = df['description'].apply(lambda x: " ".join(x.strip('_') for x in x.split()))
df['description'][p]
Out[25]:
'for instances without a public ip on private subnets to get outbound internet you need a nat and igw internet gateway components they will not have originating from the outside inbound access only instances with public ips or via a lb have that you also have to set up the routing tables for those natd subnets to go via the nat'

Remove any numbers

In [26]:
df['description'] = df['description'].apply(lambda x: " ".join(x for x in x.split() if not x.isdigit()))

df['description'][p]
Out[26]:
'for instances without a public ip on private subnets to get outbound internet you need a nat and igw internet gateway components they will not have originating from the outside inbound access only instances with public ips or via a lb have that you also have to set up the routing tables for those natd subnets to go via the nat'

Remove the stop words

In [27]:
df['description'] = df['description'].apply(lambda x: " ".join(x for x in x.split() if x not in STOPWORDS
                                                               and len(x) > 1))

df['description'][p]
Out[27]:
'instances without public private subnets get outbound internet need nat igw internet gateway components originating outside inbound access instances public ips via lb also set routing tables natd subnets via nat'

Results after cleaning data:

In [28]:
df.head()
Out[28]:
id label description
0 6967 Amazon VPC instances cannot connect internet without trying yum update get root com01 yum update loaded plugins fastestmirror priorities security updatemotd ...
1 6967 Amazon VPC three ways access internet use internet gateway along instance set nat instance route everything hardware vpn connection back premises instance ru...
2 6967 Amazon VPC oh explain lot connect earlier understand need nat device instances want instances runs igw default gw
3 6967 Amazon VPC igw along allow gain access internet must eip assigned instance instances behind nat instance procedure setup nat instance provided
4 6967 Amazon VPC assuming want assign public addresses instances need set nat instance regular instance running special nat first need least two subnets order make...

Top 30 words + frequency of each:

In [29]:
pd.Series(' '.join(df['description']).split()).value_counts()[:30]
Out[29]:
instance      12038
default       10590
vpn           10265
subnet         7462
private        5408
instances      5401
gateway        5170
route          5167
network        5150
one            4684
traffic        4481
set            4369
public         4344
using          4232
need           4194
region         4113
server         4035
connection     4027
nat            4009
use            3932
tunnel         3919
address        3884
internet       3792
security       3787
new            3723
access         3571
create         3384
ms             3252
like           3180
get            3140
dtype: int64
In [30]:
print("There are totally", df['description'].apply(lambda x: len(x.split(' '))).sum(), "words after cleaning.")
There are totally 816408 words after cleaning.

(C) Write to CleanText.csv

In [31]:
with open('C:\\Users\\Aruna\\Documents\\ACMS-IID\\input\\CleanText.csv', 'a', encoding='utf-8', newline='') as csvfile:
    writer = csv.writer(csvfile)
    # writer.writerow(['id', 'label', 'description'])
    for i in range(0, len(df['description'])):
        if len(df['description'][i]) > 1:
            writer.writerow([df['id'][i], df['label'][i], df['description'][i]])

(D) Generate the word cloud

In [32]:
msgs = " ".join(str(msg) for msg in df['description'])
fig, ax = plt.subplots(1, 1, figsize  = (100,100))
wordcloud = WordCloud(max_font_size = 20, max_words = 20, background_color = "white").generate(msgs)
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis('off')
Out[32]:
(-0.5, 399.5, 199.5, -0.5)
In [33]:
msgs = " ".join(str(msg) for msg in df['description'])
fig, ax = plt.subplots(1, 1, figsize  = (100,100))
wordcloud = WordCloud(max_font_size = 20, max_words = 50, background_color = "white").generate(msgs)
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis('off')
Out[33]:
(-0.5, 399.5, 199.5, -0.5)
In [34]:
msgs = " ".join(str(msg) for msg in df['description'])
fig, ax = plt.subplots(1, 1, figsize  = (100,100))
wordcloud = WordCloud(max_font_size = 20, max_words = 100, background_color = "white").generate(msgs)
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis('off')
Out[34]:
(-0.5, 399.5, 199.5, -0.5)
In [35]:
msgs = " ".join(str(msg) for msg in df['description'])
fig, ax = plt.subplots(1, 1, figsize  = (100,100))
wordcloud = WordCloud(max_font_size = 20, max_words = 500, background_color = "white").generate(msgs)
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis('off')
Out[35]:
(-0.5, 399.5, 199.5, -0.5)
In [36]:
msgs = " ".join(str(msg) for msg in df['description'])
fig, ax = plt.subplots(1, 1, figsize  = (100,100))
wordcloud = WordCloud(max_font_size = 20, max_words = 1000, background_color = "white").generate(msgs)
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis('off')
Out[36]:
(-0.5, 399.5, 199.5, -0.5)
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]: